Skip to content

Background sessions on Mongo control plane (Main + Worker topology)#206

Open
akseljoonas wants to merge 15 commits intomainfrom
bg-sessions-mongo-control-plane
Open

Background sessions on Mongo control plane (Main + Worker topology)#206
akseljoonas wants to merge 15 commits intomainfrom
bg-sessions-mongo-control-plane

Conversation

@akseljoonas
Copy link
Copy Markdown
Collaborator

Summary

v1 of "background-running ml-intern": agent sessions survive SSE drops, browser close, and Main Space restarts. Frontend is unchanged — backend-only refactor.

  • Topology: 1 Main Space + N Worker Spaces, same Docker image, MODE=main|worker env var differentiates. Worker Spaces have no public HTTP — they consume a Mongo-backed work queue.
  • Universal control plane: All inter-process comms go through MongoDB. The in-memory submission_queue is replaced; EventBroadcaster is kept as an opt-in read-side cache attached only when this process holds the lease (so foreground SSE skips the change-stream round-trip).
  • Lease + heartbeat ownership: findOneAndUpdate CAS on sessions.lease. TTL = 30 s, renew every 10 s. Lease loss triggers requeue_claimed_for(holder_id) and drops the session locally so a Worker can pick up orphaned submissions.
  • Three migration triggers all converge on release_session_to_background(reason): SSE-drop grace period (default 180 s), manual POST /api/session/{id}/background, and Main lifespan shutdown for active turns. Each emits a migrating session_event the frontend can render as "reconnecting".
  • Worker idle eviction: 30-minute default; predicate is not is_in_tool_call AND not is_processing AND no pending_submissions in Mongo.

Acceptance drills (manual, post-deploy)

  1. Close laptop, come back: Start a session that launches a long-running tool (HF Job training), close the tab, wait 3+ minutes, reopen 30 minutes later. Frontend replays the missed events; pending approval is visible; approve and the session resumes.
  2. Main restart with active turn: Start a session mid-tool-call, force-restart Main (or docker restart). A Worker claims the lease within 30 s; the user reconnects to a fresh Main and sees the turn continuing without an interrupted event.

Both drills are runnable locally against docker run mongo:7 --replSet rs0 + 2 processes (MODE=main and MODE=worker). See the new "Background sessions deployment" section in AGENTS.md.

Test plan

  • 83 new unit tests added across 6 files, all passing
    • tests/test_session_persistence_lease.py (12) — atomic CAS, FIFO, requeue ordering preservation, idempotent backfill
    • tests/test_session_manager_lease.py (12) — holder identity, claim wiring, heartbeat renewal + loss handling, lease release in finally
    • tests/test_session_manager_submissions.py (18) — enqueue paths, holder/non-holder interrupt, FIFO drain, inline interrupt/shutdown handling, poison containment
    • tests/test_sse_holder_overlay.py (7) — holder fast path, non-holder change stream, polling fallback on PyMongoError, subscriber counter attach/detach
    • tests/test_lifespan_grace_sweep.py (8) — release_session_to_background event emission, grace-sweep predicate matrix, /background route 200/404
    • tests/test_worker_idle_eviction.py (13) — idle predicate matrix, claim_dormant_session user-bypass + lease-taken paths, claim-tick skips already-held
  • Pre-existing unit suite still green (248 passing; two unrelated pre-existing test_doom_loop failures verified via git stash)
  • ruff check clean on every file modified by this PR (pre-existing E402/F401/F841 in main.py and routes/agent.py predate this branch)
  • Manual: drill 1 (close laptop) on the deployed Spaces
  • Manual: drill 2 (Main restart with active turn) on the deployed Spaces
  • Manual: pre-deploy baseline db.sessions.count({runtime_state: "processing"}) to capture blast-radius

Deployment notes (see AGENTS.md for the full runbook)

  1. Roll Workers first with the new code. They start consuming pending_submissions against any session whose lease has expired, but won't see anything new until Main also rolls.
  2. Then roll Main. MongoSessionStore.init() runs an idempotent backfill: sessions with last_active_at > now-1h and no lease get an empty lease so the next CAS succeeds; older sessions flip to runtime_state: idle (still recoverable, never ended).
  3. The lifespan shutdown sweep on the OLD Main releases active-turn leases via release_session_to_background(reason="main_shutdown"); Workers pick them up within 30 s.

Required env vars:

Var Default Effect
MODE main worker flips to the worker-loop entrypoint
MONGODB_URI unset Required for the control plane; without it falls back to NoopSessionStore (CLI compatibility)
GRACE_PERIOD_SECONDS 180 SSE-drop grace before background migration
IDLE_EVICTION_SECONDS 1800 Worker idle eviction TTL

MONGODB_URI must point at a replica set (Atlas works out of the box) for change-stream support. Single-node Mongo falls back to 500 ms polling.

Process & follow-ups

  • Design produced via /deep-dive (8 interview rounds + 3-lane causal trace) and refined via /plan --consensus --direct (Architect + Critic over 2 iterations, both APPROVED). Spec, trace, and consensus plan are committed under .omc/specs/ and .omc/plans/ for posterity.
  • Final architect verification: APPROVE WITH MINOR CLEANUP. Five LOW-severity follow-ups deferred to v1.1:
    1. Heartbeat misleading-WARN guard when session drops mid-iteration
    2. delete_session requeue ordering (race window for clean-delete UI)
    3. claim_dormant_session lease release on rebuild failure (TTL recovers in 30 s)
    4. pending_submissions payload size guard (16 MB BSON ceiling)
    5. Mongo-outage chaos test in CI

Risks

  • Mongo replica set is now a hard production requirement (was advisory). Single-node deployments degrade to polling and break _lease_heartbeat_loop renewals on outage — heartbeat returns None, _on_lease_lost cleans up gracefully, clients reconnect.
  • Latency: foreground SSE consumers stay at today's profile (broadcaster fast path). Cross-process consumers see 50–200 ms p50 from the change-stream tail.
  • Worker capacity at v1: MAX_SESSIONS_PER_WORKER × N Workers. Scaling is operational (deploy more Workers); the lease design supports it without code changes.

Out of scope (deferred)

  • Push notifications when away (NotificationGateway stays opt-in, only Slack provider wired)
  • Async-dashboard UX rebuild
  • 10k concurrent simultaneous load test (design accommodates it; v1 ships at 1 Main + 2 Workers)
  • Tool-call double-execution / Worker-crash idempotency (Workers assumed stable in v1)

Spec, trace, and consensus implementation plan for v1 of background-running
ml-intern: sessions survive SSE drops, browser close, and Main Space
restarts via a Main+Workers HF Space topology with MongoDB as the universal
control plane. Lease+heartbeat ownership, configurable grace period, and
Worker idle eviction. Frontend stays unchanged.

Plan was reviewed by architect + critic to consensus over 2 iterations.
Extends MongoSessionStore with the building blocks of the lease+heartbeat
control plane:

- claim_lease / renew_lease / release_lease: atomic CAS via
  findOneAndUpdate, with TTL semantics and idempotent release.
- enqueue_pending_submission / claim_pending_submission /
  mark_submission_done / requeue_claimed_for: FIFO submission queue with
  handover-safe requeue (preserves created_at across lease changes).
- change_stream_pending_submissions / change_stream_events: real-time
  tailers with PyMongoError signalling for the polling fallback path.
- make_holder_id(mode): "{mode}:{hostname}:{uuid7-or-uuid4-suffix}".

MongoSessionStore.init() now runs an idempotent backfill on startup:
sessions with last_active_at within the last hour get an empty lease so
the next CAS succeeds; older sessions flip to runtime_state=idle (still
recoverable, not ended).

NoopSessionStore mirrors the new surface so the public CLI still works
without MongoDB.

Tests: 12 new in tests/test_session_persistence_lease.py covering CAS
exactly-once under concurrency, FIFO claim order, requeue ordering
preservation, recency-split backfill, idempotent re-init, and holder-id
format. Uses mongomock-motor (added to dev deps).
Wires the lease+heartbeat ownership model into SessionManager so each
process picks a stable holder_id at startup, claims a Mongo lease before
instantiating any runtime session, renews held leases on a TTL/3
cadence, and cleans up correctly when ownership transfers.

Highlights:
- __init__ reads MODE env (main|worker, defaults main on invalid with
  WARN), computes self.mode and self._holder_id via make_holder_id().
- start() launches _lease_heartbeat_loop (10 s = TTL/3).
- _on_lease_lost calls requeue_claimed_for(holder_id), pops the session,
  cancels its task without awaiting, and logs WARN. This is the Step 1.5
  requeue path so a Worker can pick up orphaned submissions.
- create_session and ensure_session_loaded claim a lease before starting
  or restoring a session; ensure_session_loaded refuses to restore when
  another process holds the lease.
- _run_session, shutdown_session, and delete_session release the lease
  on exit.
- close() cancels and awaits the heartbeat task.

Tests: 12 new in tests/test_session_manager_lease.py covering holder
identity, claim wiring, heartbeat renewal, Noop fallthrough, lease-loss
requeue+drop, release in _run_session finally, and close-time cancel.
Adjusted unit/test_session_manager_persistence.py helper to seed the new
__init__ attributes; 35 tests total pass.

The submission-consume loop, EventBroadcaster, lifespan, and routes are
unchanged (US-004 / US-005 / US-006 territory).
Rewrites the central session loop so user input, approvals, interrupts,
and shutdowns flow through the durable pending_submissions collection
instead of an in-process asyncio.Queue. This is the substrate change
that lets a Worker pick up a session another process originally created.

backend/session_manager.py:
- Drop submission_queue field from AgentSession and its three
  construction sites; add last_submission_at for US-007 idle eviction.
- _consume_submissions: change-stream-first consumer with a 500ms
  polling fallback when the deployment is not a replica set.
- _drain_and_process: FIFO atomic claim loop. Handles op_type
  "interrupt" (session.cancel()) and "shutdown" (session.is_running=
  False) inline; everything else dispatches through process_submission.
  Marks each claimed submission done in finally so poison rows can't
  redeliver.
- Replace _run_session's queue-poll loop with await
  _consume_submissions; the session_manager is the only thing reaching
  into pending_submissions.
- Rewrite submit / submit_user_input / submit_approval / undo / compact
  / shutdown_session as thin enqueue calls via a shared
  _enqueue_or_false helper that checks in-memory presence OR
  store.load_session() so cross-process delivery works.
- Rewrite interrupt with the holder branch: in-memory holder calls
  session.cancel() directly; non-holders enqueue op_type "interrupt"
  for the actual holder to consume.

is_in_tool_call flag (Critic MAJOR #5):
- agent/core/session.py: new boolean on Session.
- agent/core/agent_loop.py: wraps both asyncio.gather tool batches
  (non-approval and post-approval) with True/False around the await.
- agent/tools/research_tool.py: same wrapper around the sub-agent's
  tool_router.call_tool.

Tests:
- tests/test_session_manager_submissions.py (new, 18 tests): enqueue
  paths, holder fast-path interrupt, non-holder enqueued interrupt,
  FIFO drain order, inline interrupt/shutdown handling, poison-
  submission containment, store-disabled fallback, _build_operation.
- Existing fixtures (test_session_manager_lease,
  test_messaging, test_session_manager_persistence) updated to drop
  the submission_queue kwarg and the now-removed positional argument
  to _run_session.

55 tests pass in the lease+submission verification suite (12 + 12 +
18 + 6 + 5 + 2 inline). Broader unit suite remains green at 248 (two
pre-existing test_doom_loop failures verified unrelated via git stash).
Keeps EventBroadcaster as an opt-in read-side cache attached only when
this process holds the session's lease. Non-holders tail Mongo
change_stream_events with a 500 ms load_events_after polling fallback
on PyMongoError or NotImplementedError. Writes still go through Mongo
first via Session.send_event -> append_event (durability invariant
verified, no change needed in agent/core/session.py).

backend/session_manager.py:
- AgentSession gains holder_id (set when lease is claimed).
- SessionManager gains _subscriber_counts and _no_subscriber_since
  dicts plus _attach_subscriber / _detach_subscriber helpers; both
  SSE transports call them at attach + finally-detach. Used by the
  upcoming grace-period sweeper.

backend/routes/agent.py:
- _sse_response now takes (session_id, agent_session, *, replay_events,
  after_seq). Generator does the replay-then-live pattern in two
  branches: holder subscribes to in-process broadcaster; non-holder
  opens change_stream_events from last_seen_seq, with a 500 ms poll
  fallback. Keepalives preserved on both paths.
- chat_sse no longer subscribes pre-submit (Mongo durability makes the
  early-subscribe-to-not-miss-events trick unnecessary).
- subscribe_events updated to the new signature.

Tests: tests/test_sse_holder_overlay.py (new, 7): holder fast path,
non-holder change-stream, fallback-to-poll on PyMongoError, subscriber
counter attach/detach on both paths, replay phase respects after_seq,
terminal event in replay ends stream cleanly.

62 tests pass cumulative across the lease/submission/SSE suites.
Wires the three Spec migration triggers into the Main Space process:

1. Main shutdown (deploy): lifespan hook iterates self.sessions and
   for any session in runtime_state="processing", calls
   release_session_to_background(reason="main_shutdown") before
   session_manager.close().
2. SSE-drop grace period: a 30 s sweeper task scans sessions held by
   this process. When _no_subscriber_since[sid] is older than
   GRACE_PERIOD_SECONDS (default 180) AND the session has work
   (is_processing, is_in_tool_call, or pending submissions in Mongo),
   it auto-releases. Idle-with-no-work sessions are left alone so an
   open-and-walk-away tab doesn't get backgrounded for nothing.
3. Manual button: POST /api/session/{id}/background route reuses
   _check_session_access and calls the same helper.

All three paths converge on release_session_to_background(sid, reason),
which:
- Emits a `migrating` session_event (durably persisted via append_event
  so non-holder SSE readers see it on the next change-stream tick),
- Calls store.requeue_claimed_for(self._holder_id) so a Worker can
  pick up our claimed-but-uncompleted submissions,
- Calls store.release_lease(sid, self._holder_id),
- Drops the session from self.sessions and cancels its agent task.

Tests: tests/test_lifespan_grace_sweep.py (new, 8) covering the helper,
the grace-sweep predicate matrix (subs present, grace+work, grace
without work), and the route. _bare_manager fixtures in two existing
test files updated to set manager._grace_sweep_task = None so close()
doesn't AttributeError.

70 tests pass cumulatively across the lease/submission/SSE/grace suites.
Implements the second deployable surface — Worker Spaces — and the
session-eviction policy that keeps held sessions from accumulating
forever in a Worker process.

backend/start.sh:
- New MODE=worker branch as first action, exec python -m worker.
- MODE=main path is unchanged: existing port-conflict graceful
  exit-0 hack for HF Spaces dev mode preserved verbatim.

backend/worker.py (new): three-line shim that calls
asyncio.run(worker_loop()) at module entry. Uses `from main import
worker_loop` to match the existing `WORKDIR=/app/backend` contract.

backend/main.py:
- worker_loop(): boots session_manager (which now starts heartbeat,
  grace sweep, AND idle eviction tasks), then runs _worker_claim_tick
  on a 1 s cadence with 2 s back-off on errors. Closes cleanly on
  cancellation.
- _worker_claim_tick(): scans db.pending_submissions for status=
  "pending" docs, skips session_ids we already hold, and calls
  claim_dormant_session for the rest. Lease CAS handles contention.

backend/session_manager.py:
- Refactor: _rebuild_agent_session_from_store() extracted from
  ensure_session_loaded so claim_dormant_session can reuse it.
- claim_dormant_session(session_id): for the worker_loop's claim
  path. Bypasses the user-ownership gate (process-level trust); the
  lease CAS still enforces one-holder-at-a-time.
- _idle_eviction_loop: 60 s tick. Releases sessions held by us
  whose last_submission_at is older than IDLE_EVICTION_SECONDS
  (default 1800) AND not is_processing AND not is_in_tool_call AND
  no pending_submissions in Mongo. No migrating event - nobody is
  watching an idle session. start()/close() manage the task.
- last_submission_at switched from asyncio.get_event_loop().time()
  to time.time() so the idle predicate uses the same wall clock as
  _no_subscriber_since.

Tests: tests/test_worker_idle_eviction.py (new, 13): idle predicate
matrix, claim_dormant_session bypass + lease-taken paths, claim-tick
skips already-held sessions. _bare_manager() fixtures in three
existing files updated to set manager._idle_eviction_task = None.

83 tests pass cumulatively.
Adds 11 observability log lines across the new control plane and a
deployment-runbook section to AGENTS.md. No behavior change.

backend/session_manager.py (10 log points):
- lease_claim INFO at all three lease-claim sites (create_session,
  ensure_session_loaded, claim_dormant_session).
- lease_release INFO at all three release sites (release_session_to_
  background, _run_session finally, _idle_eviction_loop).
- requeue_claimed INFO when count > 0 (_on_lease_lost,
  release_session_to_background).
- migrating_emitted INFO in release_session_to_background after the
  send_event.
- pending_submission_lag DEBUG in _drain_and_process; only logs when
  lag > 100 ms to avoid noise.

backend/routes/agent.py (1 log point):
- replay_event_count INFO at the end of the replay phase in
  _sse_response, on both the terminal-event and phase-complete paths.

AGENTS.md: new "Background sessions deployment" section covers:
- Two-tier topology (Main + Worker(s) sharing one Docker image, MODE
  env var differentiates).
- Deploy ordering (Workers first, Main last; backfill runs in
  MongoSessionStore.init).
- Blast-radius query to capture pre-deploy.
- Env vars table (MODE, MONGODB_URI, GRACE_PERIOD_SECONDS,
  IDLE_EVICTION_SECONDS).
- Local two-process stack: docker mongo replica set + Main + Worker.
- Chaos test (docker pause/unpause Mongo) verifying change-stream
  resume token.
- Observability grep patterns for the new log lines.

83 tests pass. Final story of the bg-sessions-mongo-control-plane
refactor.
Five small cleanups identified during the post-Ralph anti-slop pass.
No behavior change; 83 tests remain green.

- agent/core/session_persistence.py: drop redundant `return None` from a
  -> None function (release_lease).
- backend/session_manager.py: collapse the duplicate WARN+INFO log pair
  in _on_lease_lost into a single WARN; switch to %-style formatting
  consistent with the surrounding file. Drop the impossible
  `if not agent_session: return` guard in _run_session — the caller
  inserts the session before creating the task, so the dict lookup
  cannot miss; direct access now surfaces any future invariant break.
- agent/core/agent_loop.py: remove two comments that just restated the
  next line of well-named code.
- agent/tools/research_tool.py: drop the in-loop `import json as _json`
  shadow; the module-level json import is already in scope.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

PR Review: Background sessions on Mongo control plane

Tasks

  • Gather context (CLAUDE.md, REVIEW.md)
  • Read git diff and key changed files
  • Review session persistence / lease logic
  • Review session manager changes
  • Review routes / main.py changes
  • Review worker entrypoint and start.sh
  • Review test coverage
  • Post findings

View job run

- .gitignore: add .omc/, .omx/, AGENTS.md, uv.lock so transient
  AI-tooling artifacts and locally-generated lockfiles aren't tracked.
- docs/deployment.md (new): the background-sessions deployment runbook
  previously embedded in AGENTS.md, now a standalone doc that doesn't
  require AGENTS.md to be in the tree.
- AGENTS.md: removed (was tracked); contents superseded by
  docs/deployment.md for the parts that belong in the repo.
- uv.lock: removed; regenerated locally via `uv sync`.
- .omc/specs, .omc/plans, .omc/prd.json, .omc/progress.txt: removed;
  these were design-pipeline artifacts. The PR description carries the
  durable summary.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

PR Review: Background sessions on Mongo control plane

Tasks

  • Gather context (CLAUDE.md, REVIEW.md)
  • Read git diff and key changed files
  • Review session persistence / lease logic
  • Review session manager changes
  • Review routes / main.py changes
  • Review worker entrypoint and start.sh
  • Review test coverage
  • Post findings

View job run

Reverts the AGENTS.md and uv.lock changes from this branch so the PR
diff is empty for both files. The deployment runbook stays in
docs/deployment.md.

- pyproject.toml: drop mongomock-motor>=0.0.36 from dev deps. The
  lease-control-plane tests already use pytest.importorskip, so they
  skip gracefully when the package is not installed; devs who want
  to run them install mongomock-motor manually.
- .gitignore: drop AGENTS.md and uv.lock entries (no-ops on already-
  tracked files); keep .omc/ and .omx/ entries.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Claude finished @akseljoonas's task in 6m 36s —— View job


PR Review: Background sessions on Mongo control plane

Tasks

  • Gather context (CLAUDE.md, REVIEW.md)
  • Read git diff and key changed files
  • Review session persistence / lease logic
  • Review session manager changes
  • Review routes / main.py changes
  • Review worker entrypoint and start.sh
  • Review test coverage
  • Post findings

2 P0, 3 P1
Verdict: changes requested

What I checked

  • Lease CAS logic: claim_lease / renew_lease / release_lease in session_persistence.py
  • Full reconnect path from _check_session_accessensure_session_loaded_sse_response
  • _on_lease_lost scope and which submissions requeue_claimed_for touches
  • Heartbeat error handling (renew_lease on PyMongoError)
  • SSE holder/non-holder branching in _sse_response and whether the non-holder path is reachable via HTTP
  • Worker claim tick: _worker_claim_tick query and session discovery
  • Lifespan shutdown sweep: active-turn handover to Workers
  • shutdown_session lock scope with external I/O inside
  • New /background route auth guard
  • _backfill_lease_state migration safety
  • Unit test coverage (read test stubs for test_sse_holder_overlay.py, test_session_manager_lease.py)

P0 — Reconnect to Worker-held session returns 404 (acceptance drill 2 is broken)

ensure_session_loaded requires claiming the lease before returning a session object. When a Worker holds the session's lease, claim_lease returns None (filter lease.expires_at < now OR holder_id IS NULL doesn't match a fresh Worker-held lease). ensure_session_loaded returns None at session_manager.py:843–847, and _check_session_access raises HTTP 404 at routes/agent.py:190–192.

This blocks every endpoint — including GET /api/events/{session_id} (the SSE reconnect endpoint) and POST /api/chat/{session_id}. The non-holder Mongo change-stream slow path in _sse_response (routes/agent.py:799–861) is unreachable via HTTP for Worker-held sessions: _check_session_access raises 404 before _sse_response is ever called.

The unit tests in test_sse_holder_overlay.py call _sse_response directly with a pre-constructed AgentSession(holder_id="other:host:zzz") that is already in manager.sessions, so they bypass _check_session_access entirely and do not catch this gap.

Acceptance drill 2 states: "A Worker claims the lease within 30 s; the user reconnects to a fresh Main and sees the turn continuing." This cannot happen because Main returns 404 for any session whose lease is currently held by a Worker.

Fix direction: ensure_session_loaded (and/or _check_session_access) needs a read-only path for sessions where the lease is held by another process. When claim_lease fails because another holder owns an unexpired lease, the function could return a lightweight stub AgentSession with holder_id set to the current holder's ID (read from the Mongo doc) and broadcaster=None. _sse_response already branches correctly on is_holder and handles agent_session with a foreign holder_id via the Mongo change-stream path — the non-holder code path works as designed; the problem is only that it is never reached. Fix this →


P0 — _on_lease_lost requeues in-flight submissions for ALL held sessions, not just the one that lost its lease

_on_lease_lost(session_id) at session_manager.py:270 calls:

requeued = await store.requeue_claimed_for(self._holder_id)

requeue_claimed_for (session_persistence.py:707–710) runs:

await self.db.pending_submissions.update_many(
    {"status": "claimed", "claimed_by": holder_id},
    ...
)

This flips every claimed submission for this holder back to pending — across all sessions this process holds, not just session_id.

Concrete double-execution scenario: Main holds sessions A and B. B is mid-tool-call: claim_pending_submission has already set its submission to claimed. A's renew_lease returns None (either because another process took it, or because of a transient PyMongoError — see P1 below). _on_lease_lost(A) fires, calls requeue_claimed_for(Main._holder_id), and resets B's in-flight submission to pending. A Worker claims B's submission and starts executing it. Main's _drain_and_process finishes processing the same submission and calls mark_submission_done. Two processes have now executed the same submission against the same session context.

Fix direction: Pass session_id to requeue_claimed_for and filter on it: {"status": "claimed", "claimed_by": holder_id, "session_id": session_id}. Fix this →


P1 — renew_lease returns None on PyMongoError, which the heartbeat loop treats identically to "lease stolen"

session_persistence.py:614–617:

except PyMongoError as e:
    logger.warning(...)
    return None

session_manager.py:249–254:

renewed = await store.renew_lease(session_id, self._holder_id, ttl_s=30)
if renewed is None:
    # Lease lost — someone else holds it now.
    await self._on_lease_lost(session_id)

A transient Mongo network error fires _on_lease_lost, cancels the session's task, and (because of P0 #2) requeues in-flight submissions for all concurrently held sessions — on a temporary outage, not an actual lease theft.

Fix: renew_lease should raise (or return a sentinel distinct from None) on PyMongoError so the heartbeat loop can skip and retry rather than treating it as a lost lease. Fix this →

P1 — release_lease leaves holder_id intact; heartbeat can re-acquire a just-released lease

release_lease (session_persistence.py:624–630) sets lease.expires_at = now but does NOT clear lease.holder_id. After release_session_to_background releases the lease and pops the session from self.sessions (session_manager.py:332–333), the heartbeat loop may have already snapshotted session_ids (between lines 246–248) before the pop. The heartbeat then calls renew_lease(session_id, holder_id), whose filter matches {"_id": session_id, "lease.holder_id": holder_id} — the holder_id was not cleared. This re-extends the lease by 30 s, blocking the Worker from claiming the session for up to one full TTL after the intended handover. Fix this →

P1 — shutdown_session holds _lock for up to 5 seconds while awaiting task completion

session_manager.py:1446–1462 acquires self._lock and then calls both await self._store().release_lease(...) (external I/O) and await asyncio.wait_for(agent_session.task, timeout=5.0) inside the locked region. The lock is held for up to 5 seconds, serializing every other lock-requiring operation (heartbeat snapshot, release_session_to_background, grace-period sweep). Fix: release the lock before awaiting the task — move release_lease and wait_for outside async with self._lock. Fix this →


Overall: the core data-plane primitives (CAS lease, FIFO claim, change-stream tail, requeue) are well-designed and the test suite is thorough. The P0s are both in the boundary between the persistence layer and the session/route layer — they are fixable with targeted changes. The non-holder SSE path is fully implemented and just needs to be made reachable.

…trol-plane

# Conflicts:
#	agent/core/agent_loop.py
#	backend/session_manager.py
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Claude finished @akseljoonas's task in 1m 47s —— View job


PR Review: Background sessions on Mongo control plane

Tasks

  • Gather context (CLAUDE.md, REVIEW.md)
  • Check P0 Compacting context #1: reconnect-to-Worker-held session (ensure_session_loaded)
  • Check P0 Copying mcp tools #2: _on_lease_lost requeues all sessions (requeue_claimed_for scope)
  • Check P1 fixes (renew_lease sentinel, release_lease holder_id, shutdown_session lock)
  • Post findings

2 P0 still open (unchanged from prior review), 3 P1 not yet addressed
Verdict: changes requested

What I checked

  • ensure_session_loaded (session_manager.py:828–878) — lease-claim path when another holder owns the session
  • _on_lease_lost + requeue_claimed_for (session_manager.py:263–284, session_persistence.py:710–729) — scope of submission requeue
  • renew_lease error handling (session_persistence.py:628–630) — PyMongoError sentinel
  • release_lease update (session_persistence.py:638–641) — whether holder_id is cleared
  • shutdown_session lock scope (session_manager.py:1478–1491) — I/O inside lock
  • Heartbeat snapshot/iteration loop (session_manager.py:240–261) — no changes

P0 — Reconnect to Worker-held session returns 404 (unchanged)

ensure_session_loaded (session_manager.py:874–878) still returns None when claim_lease returns None:

if claimed is None:
    logger.info(
        f"Refusing restore of {session_id}: lease held by another process"
    )
    return None

_check_session_access in routes/agent.py raises HTTP 404 on None. The non-holder change-stream slow path in _sse_response remains unreachable via HTTP for any session currently held by a Worker. Acceptance drill 2 is still broken. Fix this →

P0 — _on_lease_lost requeues in-flight submissions for ALL held sessions (unchanged)

_on_lease_lost (session_manager.py:272) still calls:

requeued = await store.requeue_claimed_for(self._holder_id)

requeue_claimed_for (session_persistence.py:719–724) still filters only on {"status": "claimed", "claimed_by": holder_id} — no session_id predicate. A single lease loss triggers a broad requeue across every session this holder owns, enabling the double-execution scenario described in the prior review. Fix this →


The three P1s from the prior review (renew_lease PyMongoError sentinel, release_lease holder_id clear, shutdown_session lock scope) are also still open but not re-flagged here per re-review policy.

Five correctness fixes flagged by an automated PR review on the
bg-sessions branch:

P0 — reconnect to a Worker-held session returned 404
ensure_session_loaded called claim_lease and returned None when
another holder owned the lease, which made _check_session_access
raise HTTP 404 for every session in background mode. The non-holder
SSE slow path was unreachable. Fix: when claim_lease fails because
another process holds the lease, build a stub AgentSession with
holder_id=<foreign> and explicit None for session/tool_router/
broadcaster, and return it without inserting into self.sessions.
The SSE generator already branches on
holder_id == session_manager._holder_id and falls through to the
Mongo change-stream tail. Made AgentSession.session and
AgentSession.tool_router Optional to support stubs.

P0 — _on_lease_lost requeued submissions for ALL held sessions
requeue_claimed_for(holder_id) ran update_many filtered only on
holder, flipping every claimed submission this Main owned. A
transient lease loss on session A could cause Worker double-execution
of session B's in-flight submission. Fix: requeue_claimed_for now
takes an optional session_id; _on_lease_lost passes it. The other
call sites (release_session_to_background, lifespan shutdown sweep)
keep the no-arg behavior on purpose.

P1 — renew_lease swallowed PyMongoError as None
A transient Mongo network blip looked identical to "lease stolen"
and triggered _on_lease_lost. Fix: renew_lease no longer catches
PyMongoError; the heartbeat loop wraps the call in try/except,
logs a warning, and continues to the next session. Only
renewed is None (CAS mismatch — real lease theft) calls
_on_lease_lost.

P1 — release_lease left holder_id intact
A heartbeat tick that snapshotted session_ids before the
release-lease pop could re-extend the lease 30 s into the future
because the renew CAS still matched on holder_id. Fix:
release_lease now sets lease.holder_id to null alongside
lease.expires_at = now.

P1 — shutdown_session held _lock during external I/O
The lock was held while awaiting release_lease (Mongo I/O) and
asyncio.wait_for(task, timeout=5.0). Fix: scope the lock to just
the dict pop; release_lease and the task drain run unlocked.

Tests: 9 new across tests/test_session_persistence_lease.py and
tests/test_session_manager_lease.py covering each fix and the
backward-compat path for requeue_claimed_for. 94 tests pass on the
target suite (up from 85); 302 in the broader unit run.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

PR Review: Background sessions on Mongo control plane

Tasks

  • Gather context (CLAUDE.md, REVIEW.md)
  • Check P0 Compacting context #1: ensure_session_loaded reconnect path for Worker-held sessions
  • Check P0 Copying mcp tools #2: _on_lease_lost / requeue_claimed_for scope
  • Check P1 fixes (renew_lease sentinel, release_lease holder_id, shutdown_session lock)
  • Post findings

View job run

Removes the six new test files added across the bg-sessions stories
(test_session_persistence_lease, test_session_manager_lease,
test_session_manager_submissions, test_sse_holder_overlay,
test_lifespan_grace_sweep, test_worker_idle_eviction) and restores
tests/unit/test_messaging.py and tests/unit/test_session_manager_persistence.py
to their state on main. The tests will land in a follow-up PR so
this one stays scoped to the production refactor.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Claude finished @akseljoonas's task in 5m 21s —— View job


Claude finished @akseljoonas's task —— View job


PR Review: Background sessions on Mongo control plane

Tasks

  • Gather context (CLAUDE.md, REVIEW.md)
  • Check P0 Compacting context #1: ensure_session_loaded reconnect path for Worker-held sessions
  • Check P0 Copying mcp tools #2: _on_lease_lost / requeue_claimed_for scope
  • Check P1 fixes (renew_lease sentinel, release_lease holder_id, shutdown_session lock)
  • Post findings

1 P0
Verdict: changes requested

What I checked

  • ensure_session_loaded stub construction (session_manager.py:892–933) — lease-claim failure path, stub fields, insertion guard
  • _check_session_access + _enforce_gated_model_quota interaction (routes/agent.py:179–196, 139–157) — what stub flows trigger with session=None
  • _sse_response is_holder branch (routes/agent.py:790–794) — whether stub reaches the non-holder change-stream path
  • _on_lease_lost + requeue_claimed_for (session_manager.py:281–307, session_persistence.py:719–750) — session_id scope fix
  • renew_lease PyMongoError propagation (session_persistence.py:615–633) + heartbeat catch (session_manager.py:265–270)
  • release_lease holder_id clear (session_persistence.py:635–652)
  • shutdown_session lock scope (session_manager.py:1521–1555)

Acknowledged fixes from prior review

All five prior findings are addressed:

  • P0 Compacting context #1 (reconnect to Worker-held session): ensure_session_loaded now returns a stub AgentSession with holder_id=foreign_holder and broadcaster=None instead of None. _sse_response correctly takes the non-holder change-stream slow path because is_holder evaluates to False (broadcaster is None). SSE reconnect now works. ✓
  • P0 Copying mcp tools #2 (_on_lease_lost requeues all sessions): _on_lease_lost now passes session_id=session_id to requeue_claimed_for (session_manager.py:292–293); requeue_claimed_for filters on that field when provided (session_persistence.py:737–738). Double-execution vector is closed. ✓
  • P1 (renew_lease PyMongoError): renew_lease now propagates PyMongoError; the heartbeat loop catches it at session_manager.py:265 and skips with continue, never calling _on_lease_lost. ✓
  • P1 (release_lease holder_id not cleared): release_lease now sets lease.holder_id: None alongside lease.expires_at: now (session_persistence.py:649), closing the heartbeat re-acquire window. ✓
  • P1 (shutdown_session lock scope): Only the sessions.get lookup runs inside self._lock; release_lease and wait_for(task) run outside it (session_manager.py:1538–1553). ✓

P0 — Stub with session=None crashes _enforce_gated_model_quota on submit/chat endpoints

The stub returned by ensure_session_loaded for Worker-held sessions is constructed with session=None (session_manager.py:922). _enforce_gated_model_quota (routes/agent.py:156) unconditionally dereferences agent_session.session.config.model_name:

# routes/agent.py:154-156
if agent_session.claude_counted:
    return
model_name = agent_session.session.config.model_name  # AttributeError if session is None

The stub's claude_counted defaults to False (not read from the Mongo doc), so the early-return guard does not fire. Every invocation of _enforce_gated_model_quota against a stub raises AttributeError, which FastAPI surfaces as HTTP 500.

Affected paths (all call _check_session_access then _enforce_gated_model_quota):

  • POST /api/submit (routes/agent.py:598–599) — user submitting text to an ongoing Worker turn
  • POST /api/chat/{session_id} (routes/agent.py:641–654) — combined submit+stream
  • POST /api/approve reaches _check_session_access but does not call _enforce_gated_model_quota, so it is not affected by this specific crash — but it would still return a stub that submit_approval would then handle correctly via the Mongo queue.

Scenario (post-drill-2): Worker holds session S. User reconnects from a fresh Main — SSE works. User then sends a new message: POST /api/submitensure_session_loaded returns stub (session=None) → _enforce_gated_model_quota crashes → HTTP 500.

Fix direction: In _enforce_gated_model_quota, add a guard for session=None — stubs represent sessions executing on a remote Worker and do not need local quota enforcement (the Worker already enforced quota when it built the real session). A one-line guard suffices:

if agent_session.session is None:
    return  # stub for remotely-held session; quota enforced by holder

Alternatively, read claude_counted from the loaded Mongo doc in the stub constructor so the guard at line 154 fires for already-charged sessions. Fix this →

Wires the manual handoff path into the existing React UI. The backend
route POST /api/session/{id}/background already exists; this is purely
frontend.

Behavior:
- A moon IconButton appears in the composer when the Main process
  holds the lease and the agent has work in flight (isProcessing or
  pending approval). Click POSTs to the background route. Hidden once
  the session is already backgrounded.
- The migrating SSE event is now a recognized EventType. The transport
  emits finish-step + finish (mirroring the interrupted/shutdown
  pattern) so the AI SDK closes the stream cleanly, then triggers
  reconnectToStream so SSE switches from the holder fast-path to the
  Mongo change-stream slow-path. The user keeps watching live.
- ActivityStatus gains a 'migrating' variant. ActivityStatusBar shows
  "Running in background".
- SessionMeta gains an isBackgrounded flag, set from the migrating
  event handler and (defensively) from the click handler optimistic
  path. Sticky for the session lifetime.
- Sessions sidebar swaps the chat-bubble icon for a cloud icon when
  isBackgrounded is true. The amber needsAttention pulse dot still
  composes on top.
- The chat panel renders a muted-info banner ("Running in background
  — you can close this tab and come back anytime") when the session
  is backgrounded.
- Mount-time hydration tolerates the get_session_info 404 that occurs
  for non-in-process sessions: when /messages succeeds but /session
  returns 404, infer isBackgrounded=true instead of falling through to
  markExpired. Reload-into-backgrounded-session now renders the banner
  rather than the catch-up flow.

Errors on the POST currently fall back to console.error — there is no
toast helper in the existing codebase. v1.1 follow-up.

Two backend gaps documented in .omc/specs/deep-dive-what-do-we-need.md
remain as v1.1 work: release_session_to_background only handles
Main-held sessions; get_session_info doesn't consult Mongo for
foreign-held sessions. Neither blocks v1 of this button (button is
hidden for Worker-held sessions; reload tolerance covers the second).

10 files changed, ~80 LOC.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Claude finished @akseljoonas's task in 1m 39s —— View job


PR Review: Background sessions on Mongo control plane

Tasks

  • Gather context (CLAUDE.md, REVIEW.md)
  • Check P0: _enforce_gated_model_quota crash with session=None stub
  • Check for any new issues in the latest commit
  • Post findings

1 P0
Verdict: changes requested

What I checked

  • _enforce_gated_model_quota (routes/agent.py:139–170) — session null guard
  • Stub AgentSession construction in ensure_session_loaded (session_manager.py:896–933) — session=None, claude_counted default
  • POST /submit and POST /chat/{session_id} call sites (routes/agent.py:598–599, 653–654)
  • Latest commit cd65715 (frontend-only: "Run-in-background button") — no new backend changes

Acknowledged fixes from prior review

The prior P0 finding from the last review round (the _enforce_gated_model_quota crash) is not yet addressed. Commit 07f0a85 ("Fix P0/P1 lease-handover correctness issues") touched session_persistence.py, session_manager.py, and tests — but not routes/agent.py.


P0 — _enforce_gated_model_quota crashes on stub session (unchanged from prior review)

ensure_session_loaded still returns a stub with session=None at session_manager.py:922. _enforce_gated_model_quota at routes/agent.py:156 unconditionally dereferences agent_session.session.config.model_name:

# routes/agent.py:154-156
if agent_session.claude_counted:
    return
model_name = agent_session.session.config.model_name  # AttributeError when session=None

The stub's claude_counted is not set in the constructor (session_manager.py:920–933), so it defaults to False (session_manager.py:112), and the early-return guard at line 154 does not fire.

Affected paths:

  • POST /api/submit (routes/agent.py:598–599) — user sending text while a Worker holds the session
  • POST /api/chat/{session_id} (routes/agent.py:653–654) — when text is not None and not approvals

Both raise AttributeError → HTTP 500 for any submit to a Worker-held session.

Fix: add a session=None guard at the top of _enforce_gated_model_quota, before line 154:

if agent_session.session is None:
    return  # stub for remotely-held session; quota enforced by holder

Alternatively, populate claude_counted=bool(meta.get("claude_counted")) in the stub constructor at session_manager.py:920–933 so already-charged sessions hit the existing guard. The one-line guard is the safer and more self-documenting fix. Fix this →

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant